Decoding visemes: improving machine lipreading (PhD thesis)

نویسنده

  • Helen L. Bear
چکیده

This thesis is about improving machine lip-reading, that is, the classification of speech from only visual cues of a speaker. Machine lip-reading is a niche research problem in both areas of speech processing and computer vision. Current challenges for machine lip-reading fall into two groups: the content of the video, such as the rate at which a person is speaking or; the parameters of the video recording for example, the video resolution. We begin our work with a literature review to understand the restrictions current technology limits machine lip-reading recognition and conduct an experiment into resolution affects. We show that high definition video is not needed to successfully lipread with a computer. The term “viseme” is used in machine lip-reading to represent a visual cue or gesture which corresponds to a subgroup of phonemes where the phonemes are indistinguishable in the visual speech signal. Whilst a viseme is yet to be formally defined, we use the common working definition ‘a viseme is a group of phonemes with identical appearance on the lips’. A phoneme is the smallest acoustic unit a human can utter. Because there are more phonemes per viseme, mapping between the units creates a many-to-one relationship. Many mappings have been presented, and we conduct an experiment to determine which mapping produces the most accurate classification. Our results show Lee’s [82] is best. Lee’s classification also outperforms machine lip-reading systems which use the popular Fisher [48] phoneme-to-viseme map. Further to this, we propose three methods of deriving speaker-dependent phoneme-to-viseme maps and compare our new approaches to Lee’s. Our results show the sensitivity of phoneme clustering and we use our new knowledge

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Understanding the visual speech signal

For machines to lipread, or understand speech from lip movement, they decode lip-motions (known as visemes) into the spoken sounds. We investigate the visual speech channel to further our understanding of visemes. This has applications beyond machine lipreading; speech therapists, animators, and psychologists can benefit from this work. We explain the influence of speaker individuality, and dem...

متن کامل

Visual gesture variability between talkers in continuous visual speech

Recent adoption of deep learning methods to the field of machine lipreading research gives us two options to pursue to improve system performance. Either, we develop endto-end systems holistically or, we experiment to further our understanding of the visual speech signal. The latter option is more difficult but this knowledge would enable researchers to both improve systems and apply the new kn...

متن کامل

Classifying Visemes for Automatic Lipreading

Automatic lipreading is automatic speech recognition that uses only visual information. The relevant data in a video signal is isolated and features are extracted from it. From a sequence of feature vectors, where every vector represents one video image, a sequence of higher level semantic elements is formed. These semantic elements are “visemes” the visual equivalent of “phonemes” The develope...

متن کامل

Speech Recognition with Hidden Markov Models in Visual Communication

Speech is produced by the vibration of the vocal cords and the configuration of the arti-culators. Because some of these articulators are visible, there is an inherent relationship between the acoustic and the visual forms of speech. This relationship has been historically used in lipreading. Today's advanced computer technology opens up new possibilities to exploit the correlation between acou...

متن کامل

Persian Viseme Classification Using Interlaced Derivative Patterns and Support Vector Machine

Viseme (Visual Phoneme) classification and analysis in every language are among the most important preliminaries for conducting various multimedia researches such as talking head, lip reading, lip synchronization, and computer assisted pronunciation training applications. With respect to the fact that analyzing visemes is a language dependent process, we concentrated our research on Persian lan...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1710.01288  شماره 

صفحات  -

تاریخ انتشار 2017